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Abstract 

Background: Epidemiology and ecology share many fundamental research questions. Here we describe how 
principal coordinates of neighbor matrices (PCNM), a method from spatial ecology, can be applied to spatial 
epidemiology. PCNM is based on geographical distances among sites and can be applied to any set of sites 
providing a good coverage of a study area. In the present study, PCNM eigenvectors corresponding to positive 
autocorrelation were used as explanatory variables in linear regressions to model incidences of eight most common 
cancer types in Finnish municipalities (n = 320). The dataset was provided by the Finnish Cancer Registry and it 
included altogether 615,839 cases between 1953 and 2010. 

Results: PCNM resulted in 165 vectors with a positive eigenvalue. The first PCNM vector corresponded to the 
wavelength of hundreds of kilometers as it contrasted two main subareas so that municipalities located in 
southwestern Finland had the highest positive site scores and those located in midwestern Finland had the highest 
negative scores in that vector. Correspondingly, the 165 th PCNM vector indicated variation mainly between the two 
small municipalities located in South Finland. The vectors explained 13-58% of the spatial variation in cancer 
incidences. The number of outliers having standardized residual > |3| was very low, one to six per model, and even 
lower, zero to two per model, according to Chauvenet's criterion. The spatial variation of prostate cancer was best 
captured (adjusted ^ = 0.579). 

Conclusions: PCNM can act as a complementary method to causal modeling to achieve a better understanding 
of the spatial structure of both the response and explanatory variables, and to assess the spatial importance of 
unmeasured explanatory factors. PCNM vectors can be used as proxies for demographics and causative agents 
to deal with autocorrelation, multicollinearity, and confounding variables. PCNM may help to extend spatial 
epidemiology to areas with limited availability of registers, improve cost-effectiveness, and aid in identifying 
unknown causative agents, and predict future trends in disease distributions and incidences. A large advantage 
of using PCNM is that it can create statistically valid reflectors of real predictors for disease incidence models with 
only little resources and background information. 
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Background 

Epidemiology and ecology share many fundamental re- 
search questions: why are the spatiotemporal distributions 
of diseases/species not uniform? What causes changes in 
the disease incidence/prevalence or species abundance? 
How can these changes be predicted? This common 
ground has generated a rapidly growing discipline, spatial 
epidemiology [1]. Although spatial epidemiology has made 
great progress in answering these questions [2], certain 
methodological problems remained unresolved. For ex- 
ample, including multiple explanatory variables in epi- 
demiological models often leads to multicollinearity [3], 
and adjusting for potential confounders may create an 
artificial connection between the outcome and exposures 
[4] causing both statistical and theoretical challenges. The 
purpose of this article is to describe and demonstrate 
how principal coordinates of neighbor matrices (PCNM), 
a method from spatial ecology, can be applied to spatial 
epidemiology to address these methodological problems. 

PCNM is based on geographical distances between dif- 
ferent sites and "can be applied to any set of sites provid- 
ing a good coverage of a geographical sampling area" [5]. 
PCNM variables model spatial relationships among sites 
in decreasing order of spatial scale. Originally, PCNM was 
developed to deal with induced spatial dependence be- 
tween statistical units due to effects of external processes. 
"True" spatial autocorrelation [5,6] refers to a situation 
when the response variable depends on itself due to in- 
ternal processes. The PCNM approach is rather simple: 
if spatial dependence cannot be avoided, it has to be 
considered as a source of information. PCNM searches 
for eigenvectors corresponding to positive autocorrel- 
ation and uses them to describe spatial structures in a 
given dataset [5,6]. 

The PCNM approach is closely related to Morans 7, an 
index of spatial autocorrelation [7]. Morans / is a correl- 
ation coefficient and it gives a linear relationships (auto- 
correlation) between dependent variables with respect to a 
spatial weighting matrix. There are several criteria to cre- 
ate the spatial weights [8]. PCNM can be seen as a par- 
ticular case of Moran s eigenvector maps where the spatial 
weighting matrix is defined with geographical (Euclidean) 
distances between study locations [6]. 

The PCNM eigenvectors, reflecting spatial scales in a 
given dataset are rotated so that they do not correlate 
with each other and sample locations can be randomly 
or systematically assigned. Consequently, PCNM statisti- 
cally avoids the multicollinearity problem. PCNM vec- 
tors can be considered as reflections of real predictors if 
1) each explanatory variable obeys one or more spatial 
patterns and 2) the geographic coverage of the sampling 
area is complete enough so that PCNM reflects most 
spatial patterns. Therefore, it is possible to create statis- 
tically valid models of disease prevalence and incidence 



by using PCNM vectors as explanatory variables instead 
of more explicit explanators of diseases, which in some 
cases can be extremely difficult to measure. PCNM vari- 
ables act as proxies for any kind of process resulting in 
spatial structuration of the response variable(s). When 
evaluating the quality of spatial patterns, it is essential to 
consider the sampling areas geographical coverage based 
on the study sites. Importantly, good coverage does not 
refer to the size of the area, and thus PCNM can be ap- 
plied to all spatial scales. 

PCNM has several advantages over autoregressive 
methods which handle autocorrelation by removing the 
spatial dependency between observations. In the case of 
disease incidence, PCNM enables structuring statistically 
reliable models of spatial variation in the incidence regard- 
ing the site i on the basis of its correspondences with 
other sites, as the spatial patterns of the incidence, as well 
as those of factors affecting the incidence, are all included 
in and explained by the vectors provided by PCNM [9]. 
The vectors produced by PCNM reflect cyclic variation in 
the outcome variable so that the first PCNM eigenvector 
corresponds to the broadest spatial scale indicating the 
spatial extent of the entire study area (large-scale vari- 
ation) and the last PCNM eigenvector corresponds to the 
finest spatial scale (small-scale variation). The PCNM vec- 
tors can be applied directly as explanatory variables to re- 
gression and canonical models [5,9]. 

The aim of this study is to present PCNM as a com- 
plementary technique, not the method of choice, to the 
most common and best validated methods for disease 
mapping, such as the Besag, York, and Mollie model 
[10] and Poisson-Gamma model(s) [11]. It is proposed 
that PCNM vectors can be used as proxies for demo- 
graphics and causative agents to deal with autocorrelation, 
multicollinearity, and confounding variables. PCNM may 
also provide a priori information for techniques based on 
Bayesian statistics. 

Methods 

Data 

Our example dataset consisted of eight most common 
cancer types in all Finnish municipalities (n = 320) be- 
tween 1953 and 2010. Cancer incidences were as follows: 
breast (age-adjusted incidence rate per 100,000 women 
was 96.6), prostate (men, 85.6), lung including trachea 
(men, 28.5; women, 12.7), colon (men, 16.1; women, 
12.7), skin melanoma (men, 13.9; women, 13.8), rectal 
(men, 11.6; women, 6.9), stomach (men, 6.8; women, 
3.9), and leukemia (men, 6.9; women, 5.9). Different can- 
cer types were dealt with separately because different 
cancers are different diseases and, therefore, it was 
highly expectable that they will at least partly follow dif- 
ferent spatial patterns [12]. 
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The main reason for the use of a large long-term data- 
set was to enable inclusion also the smallest municipal- 
ities in the analysis. In Finland, there are municipalities 
with so small population that new cancer cases are not 
found every year. The effect of "zero years" on the ana- 
lysis was diminished by pooling the data over as many 
years as possible and then using the number of new 
cases per person-year as the unit of incidence. The high 
number of Finnish municipalities together with their 
small sizes (median, 425; range, 6 - 15,053 km 2 ) provides 
an excellent coverage of the entire geographical area. 
The dataset contains no missing values. 

The dataset was provided and the research approved 
by the Finnish Cancer Registry (FCR), which maintains a 
continuously updated nation-wide database of the inci- 
dence of all primary cancers in Finland. Final coding is 
done by qualified secretaries and supervised by the 
Registry physician. Reporting of cancers is mandated by 
law and each cancer considered to be an independent 
new primary lesion is registered separately and all inde- 
pendent cancer processes are coded as separate entities. 
FCR collects data with very high accuracy and complete- 
ness [13]. Rare cancer types were excluded from the 
present study to prevent person recognition and because 
rare events typically create nonlinear relationships be- 
tween variables. 

Data for the study included the number of new cancer 
cases per year per municipality together with information 
concerning the sex structure of populations. The un- 
adjusted cancer incidence rate (cases per 1,000 person- 
years in general, per 1,000 women-years for breast cancer, 
and per 1,000 man-years for prostate cancer) instead of 
age-adjusted incidence rate was used as the dependent 
variable due to the methodological purposes. We expressly 
wanted to test how well spatial vectors alone explain the 
variation in disease incidences - without information 
concerning demographics and possible causative agents. 
This was important for two reasons: 1) many real pre- 
dictors of diseases are correlated with each other and 2) 
information about predictors may not be available. As 
both multicollinearity and the lack of information de- 
crease the explanatory power of traditional statistical 
models, PCNM can be performed to deal with these 
challenges. 

When applied to clinical purposes, PCNM vectors can 
be associated with demographics and possible causative 
agents of the target variable to search for reasons for the 
spatial patterns detected. This is simply done by model- 
ing demographics and causative agents by the same 
PCNM vectors which best explained the spatial distribu- 
tion of the target variable. In the present study, this was 
not done as the aim was to test the PCNM method first 
time in an epidemiologic context, not to find reasons for 
the spatial distribution of cancer in Finland. Furthermore, 



information concerning demographics and many sup- 
posed causative agents of cancer are available in Finnish 
Registries mainly from 1980s and 1990s, not from 1950s 
as cancer incidences. Consequently, a comprehensive 
comparison between spatial patterns of cancer and 
those of demographics and causative agents in Finland 
has to be carried out using a shorter-term dataset than 
in this study. This could be the next step in introducing 
the PCNM method to spatial epidemiology. 

If variables related to chronic diseases are explained by 
spatial patterns, it has to be taken into account that the 
study site may not be the site where the person in ques- 
tion lived when the disease started. This is important es- 
pecially at individual-level data, but may also affect results 
at group level if migration waves have been intense and/or 
long-lasting. From this point of view, Finland provides 
a good study area, as the propensity for intermunicipal 
migration in Finland is rather low [14] and two-thirds 
of Finnish people live in the county where they were 
born [15]. 

Spatial vectors 

A principal coordinate analysis of a truncated matrix of 
Euclidean distances between the 320 Finnish municipal- 
ities (as of January 1, 2013) was carried out to create the 
patterns of spatial scales. First, a 2-dimensional matrix 
of Euclidean distances (D) among the municipalities was 
conducted using the latitudes and longitudes of geo- 
graphical centers of settlement concentrations, such as 
cities, of the municipalities as initial values. The latitudes 
and longitudes were defined as meters from the equator 
and meters from the meridian 27° east of Greenwich, re- 
spectively. If the municipality had more than one settle- 
ment concentration, an average of geographical centers 
of these concentrations weighed by the number of in- 
habitants per concentration was calculated and consid- 
ered as the initial value. The geographical locations were 
determined by the National Land Survey of Finland. Sec- 
ond, a truncated connectivity matrix (W) was constructed 
according to the following rule: Wy = dy if dy = < t and 
Wij = 4£ if dy > t, where t is a threshold value indicating the 
maximum distance i.e. minimum spanning tree which 
maintains all sites (i.e. municipalities) being connected 
[5,6]. Third, eigenvectors were extracted from the cen- 
tered W. The PCNM results in vectors corresponding to 
positive and negative eigenvectors but only the positive 
ones are taken into account in further analyses. A good 
reconstruction of spatial structures is obtained by this 
method [5]. Negative eigenvectors model negative spatial 
correlation and they may be useful in some instances, but 
not in the present context. The PCNM vectors were cre- 
ated using functions of the "spacemakeR" package [6] for 
the R statistical language [16]. 
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Regression models 

Multiple linear regressions with forward stepwise pro- 
cedure were performed on the cancer incidences to test 
the explanatory power of the 165 PCNM vectors. Each 
regression model included one outcome variable (inci- 
dence of certain cancer type) and multiple explanatory 
variables (PCNM vectors). Normality of incidence rate 
distributions and residuals of the final models were 
assessed with the Kolmogorov-Smirnov test. If the distri- 
bution was non-normal (prostate cancer, skin melanoma, 
stomach cancer, and leukemia) the incidence rates were 
log 10 -transformed before linear regression analyses. The 
number of independent variables, which were able to 
improve the models more than expected by chance, was 
approximated on the basis of the adjusted coefficient of 
determination r 2 [17]. In practice, this meant that ex- 
planatory variables i.e. PCNM vectors with ^-value <0.05 
(corresponding to the alpha level stopping criterion) 
were added to the model as long as the adjusted r 2 ob- 
tained using all the PCNM vectors as explanatory vari- 
ables was reached. The "double stopping criterion" was 
necessary since using PCNM vectors as independent 
variables in a regression model may result in an inflated 
r 2 due to the fact that many vectors can reflect the same 
spatial process [17,18]. The relative significance of single 
spatial vectors in explaining the cancer incidences was 
determined on the basis of standardized coefficients. Re- 
sidual plots were drawn to evaluate the goodness-of-fit 
in the models. The regression analyses were performed 
with the IBM SPSS 19 for Windows. 

When applying the method to clinical purposes, it is 
highly important to associate the PCMN vectors also to 
actual explanatory variables before drawing conclusions, 
as the spatial processes causing the structures can be rep- 
resented by one or several PCNM vectors. In general, 
spatial processes should be modeled by a group of PCNM 
vectors, not by one vector at a time. 

Results 

PCNM vectors 

PCNM resulted in 165 vectors with a positive eigen- 
value. The threshold value t was 1,409 meters. It denotes 
the cycle wavelength of the PCNM vector which reflects 
the smallest-scale variation in the study area. Shorter 
wavelengths are possible only if more sample locations 
are offered to PCNM. As municipalities are irregularly 
arranged throughout Finland, it was impossible to deter- 
mine the exact cycle wavelength (spatial scale) of each 
PCNM vector, in the present case. In practice, however, 
the first PCNM vector corresponded to the wavelength 
of hundreds of kilometers as it contrasted two main sub- 
areas so that municipalities located in southwestern 
Finland had the most positive site scores and those lo- 
cated in midwestern Finland had the most negative 



scores in that vector. In Figure 1, the interpretation con- 
cerns the contrast between the dark and light areas. Cor- 
respondingly, the 165 th PCNM vector corresponded to t 
as it indicated variation mainly between the two small 
municipalities located in South Finland, Espoo (black 
area in Figure 1) and Kauniainen (white area in Figure 1) 
which is actually surrounded by Espoo. PCNM vectors 
can be regarded as representatives of cyclic variations 
and actual site scores as "correlation coefficients" between 
the variations and sample locations. A high absolute score 
value refers to a high correlation and a low absolute value 
to a low correlation between the variation and location in 
question. 

Regression models 

The PCNM vectors explained 13-58% of the spatial 
variation in cancer incidences according to the adjusted 
r 2 with the double stopping criterion (Table 1). The 
number of outliers with standardized residual > |3| was 
1-6 per model, which can be considered low as n in 
each case was 320. The number of outliers was even 
lower, 0-2 per model, when approximated from the nor- 
mal distribution according to the traditional Chauvenets 
criterion [19]. A few outliers were clearly influential, as 
they were located at the end of the scale far from the 
mean value of x (Additional file 1) and thus they af- 
fected slopes of the resulted regression models more 
than other values one at a time. Residuals were normally 
distributed in all cases except skin melanoma and rectal 
cancer, which also had the highest number of outliers 
(Additional file 1). The goodness-of-fit was thus low for 
the skin melanoma and rectal cancer models. 

The explanatory power was highest for the prostate 
cancer model (Figure 2) and lowest for the lung cancer 
model (Figure 3). Most models failed to estimate the inci- 
dences in individual municipalities in northern Finland, 
which was most probably due to a less comprehensive 
geographical coverage of the study area. Municipalities in 
northern Finland are large and sparsely populated and 
therefore the observed disease incidences do not represent 
the settlement concentrations of the municipalities as well 
as in middle and southern Finland. The small municipal- 
ities in Aland Archipelago and in the main island of Aland 
in southwestern Finland, however, were responsible for 
the residual extremums in the models. Standardized re- 
siduals > |4| were all connected to the municipalities in 
Aland as follows: lung and breast cancer in Sottunga 
(7.36 and 6.49, respectively), skin melanoma in Lumpar- 
land (7.62) and Sottunga (4.29), rectal cancer in Kokar 
(4.27) and Sottunga (-4.71), and colon cancer in Sund 
(4.25). Underestimated cancer incidences (high residuals) 
were much more common than overestimated incidences 
(low residuals). 
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Figure 1 PCNM patterns corresponding to the largest (eigenvector 1 on the left) and finest spatial scale (eigenvector 165 on the right) 
in the given data (320 municipalities in Finland). Dark color indicates high positive site score and light color high negative score in the vector. 



The spatial variation in prostate, breast, and colon 
cancer incidences were especially well explained by the 
vectors. The PCNM vectors 1, 6, 9, and 12 had the great- 
est explanatory power on the variation of cancer inci- 
dences in general and their standardized coefficients 
were > |0.2| in six, seven, eight, and six cases, respect- 
ively (Figure 4, Additional file 2). Although the inci- 
dences of nearly all cancer types were captured by the 
same spatial vectors with different combinations, some 



Cancer type 
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r 2 adjusted + 


Prostate 
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0.638 (22) 


0.579 (10) 


Breast 
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0.589 (25) 
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Rectal 
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0.463 


0.428 (20) 


0.302 (6) 


Leukemia 


0.616 


0.380 


0.331 (23) 


0.242 (11) 


Stomach 


0.623 


0.388 


0.356 (16) 


0.216 (4) 


Melanoma 


0.580 


0.337 


0.316 (10) 


0.130 (1) 


Lung 


0.574 


0.329 


0.289 (18) 


0.129 (4) 



*Based on the alpha level (p-value <0.05) stopping criterion; the number of 
explanatory vectors included in brackets. 

+ Based on the double stopping criterion; the number of explanatory vectors 
included in brackets. See text for a more detailed description of the method. 



variation was also captured by a few vectors that were 
cancer type-specific. Prostate, colon, and stomach can- 
cers were mainly associated with large-scale vectors, 
whereas breast cancer was related to large- and small- 
scale vectors, leukemia to scales of all sizes, and lung can- 
cer to vectors reflecting medium-sized scales (Additional 
files 1 and 2). 

Discussion 

What PCNM can bring to spatial epidemiology? 

This analysis demonstrates the usability of PCNM for 
producing statistically valid models of cancer incidence 
distribution. In the models, PCNM vectors can be used 
as proxies for demographics and causative agents to deal 
with autocorrelation, multicollinearity, and confounding 
variables. This, however, does not mean that PCNM vec- 
tors can replace explicit causative agents, when model- 
ing disease incidences. Rather, spatial modeling can act 
as a complementary method to causal modeling 1) to 
achieve a better understanding of the spatial structure of 
both the response and explanatory variables and 2) to 
assess the spatial importance of unmeasured and/or pre- 
viously unknown explanatory factors. 

Spatial structures are rarely simple enough to be cap- 
tured by a single PCNM vector. Grouping the PCNM 
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Figure 2 Observed and modeled incidences of prostate cancer (per 1 7 000 man-years) expressed in relation to observed minimum and 
maximum. Standardized residuals reflect the goodness-of-fit of the model in different areas. The arrow points the outlier. 



vectors by scale is the procedure to uncover underlying 
processes occurring at different spatial scales [5,9]. 
These vector combinations may help to interpret the re- 
sults as the most evident spatial processes are more 
clearly separated from each other. Our example was car- 
ried out in Finland, where many reasons for the non- 
uniform spatiotemporal distribution of different cancer 
types are known due to the high quality and large number 
of registers containing information about diseases them- 
selves, socio-demographics, and many health-related is- 
sues. Importantly, PCNM enables studying areas where 
the availability of information is restricted, and is appli- 
cable for assessing the spatial distribution of different 



diseases. Because the countries without comprehensive 
registers are often developing countries with inadequate 
health care, poor health conditions, and high prevalence 
of serious infectious diseases, benefits from this kind of 
epidemiological research can be very high. As sample 
locations can be randomly or systematically assigned, 
PCNM does not require a register as a data source, but 
(any kind of) findings which can be located on a map. 
In this case, a map refers to (any kind of) 2- or 3- 
dimensional area, such as country, city, and building. 
To maintain comparability between the corresponding 
PCNM variables, the locations have to remain the same 
through time. 
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Figure 3 Observed and modeled incidences of lung cancer (per 1,000 person-years) expressed in relation to observed minimum and 
maximum. Standardized residuals reflect the goodness-of-fit of the model in different areas. The arrow points the outlier. 



PCNM also improves the cost-effectiveness of epide- 
miologic research, as primary models can be created on 
the basis of unsupervised spatial vectors before collect- 
ing data on factors affecting the outcome. These pri- 
mary models can be utilized in first-line evaluations of 
the research impact. They also serve as excellent prede- 
cessors to more explicit research questions. In spatial 
epidemiology, the PCNM vectors can be used to identify 
possible causative agents of diseases. If the disease and 
the exposure obey the same pattern of spatial vectors, 
they have to be somehow connected. A connection itself 
is not a proof of causal relationship, but all previously un- 
known associations can be valuable pointers for further 



research. In addition to spatial similarities between dis- 
eases and their causative agents, areas where disease in- 
cidences do not follow modeled spatial patterns can be 
of interest. These "exceptions to the rule" -areas denote 
that an exposure specifically important to these areas is 
missing from the model. The technique known as vari- 
ation partitioning can be coupled to PCNM to identify 
the proportions of explained spatial and non-spatial 
variation [20]. 

PCNM may help to model and visualize changes in 
disease distributions. The areas created on the basis of 
PCNM vectors are not predecided but data-driven and 
thus flexible. This is a drastic contrast to methods based 
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Figure 4 Single spatial vectors which explained the incidences of different cancer types most. Areas with high positive and high negative 
site scores in the PCNM vector have been denoted by dark and light colors, respectively. The number after the name of the cancer type informs 
the correlation coefficient (Pearson's r) for the relationship between the incidence and the vector in question. 
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on preappointed permanent sites, such as the method 
currently used to visualize spatiotemporal variation in 
cancer in Finland [21]. This does not mean that the latter 
methods are inadequate, but highlights the applicability of 
PCNM. PCNM also enables generating models that do 
not aim to forecast each sample site but larger areas sim- 
ultaneously, which again improves cost-effectiveness. If 
certain sites are explained by the same spatial vectors the 
number of sample sites and/or sampling frequency can be 
lowered. 

Although the most evident solution to model spatio- 
temporal variation is to deal with the temporal and 
spatial effects as separate factors, methods for modeling 
non-separate space-time variation are also available 
[22,23]. Especially, Bayesian statistics provides solutions 
to model the spatial and temporal effects simultaneously 
[22,24], as well as to associate disease incidences with 
their causative agents [25]. In PCNM, it is possible to 
separate time and space, or to include time as a dimen- 
sion in the input matrix. The latter solution enables 
modeling of phenomena which are affected by variables 
with varying spatial patterns and/or temporal cycles. In 
the absence of replication, spatiotemporal structures can 
also be studied using PCNM variables by the analysis of 
variance technique [26]. In this study, we intentionally 
excluded temporal aspects of variation in cancer inci- 
dences because we expressly wanted to test PCNM in 
the context of spatial epidemiology and create easily in- 
terpretable vectors representing 2-dimensional space. 

Limitations of the present study and the PCNM method in 
general 

Variation in the vector combinations partly explained the 
differences in explanatory power of the regression models 
for different cancer types. If the incidence is explained 
mainly by PCNM vectors reflecting large-scale variation 
(the first PCNM variables), as in the case of prostate can- 
cer (Table 1, Additional file 2), the explanatory power of 
the resulted model is high per se. Each small-scale vector 
corresponding to variation in a highly restricted geograph- 
ical area in turn is able to explain only a small part of the 
total variation and thus a combination of several small- 
scale vectors is needed to reach the explanatory power of 
one large-scale vector, as was the case with leukemia 
(Table 1, Additional file 2). In other words, If the true 
spatial variation of the dependent variable is low (e.g., lung 
cancer; Figure 3) the explanatory power of the PCNM var- 
iables will also be low. The last PCNM vectors typically 
model case-specific peculiarities and they are of very lim- 
ited general interest. 

Migration may distort the connections between disease 
incidences and sites. The spatial patterns of observed dis- 
ease incidence do not necessarily reflect the spatial pat- 
terns of exposures that have initiated the development of 



the disease. Explaining the observed disease incidence 
on the basis of spatial patterns is unreliable and even 
unadvised if the dataset consists of individual-level data 
and has high migration rate. This is especially problem- 
atic in the case of diseases with slow progression rates. 
Moreover, if the spatial distribution is random rather 
than systematic it cannot be predicted on the basis of 
spatial vectors reflecting cyclic variation. 

Conclusions 

PCNM may help to extend spatial epidemiology to new 
areas, improve the cost-effectiveness via primary models 
and aid in identifying previously unknown causative 
agents and predict future trends in disease distributions 
and incidences. The models created with eigenvectors can 
be valuable as such to deal with autocorrelation, multicol- 
linearity, and confounding variables or as predecessors for 
more explicit research questions. PCNM will benefit espe- 
cifically early phase explorative research. A large advan- 
tage of using PCNM is that it can create statistically valid 
reflectors of real predictors for disease incidence models 
with only little resources and background information. 
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Additional file 1: Relationships between observed and predicted 
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